Processing method for high order tensor data

ABSTRACT

A processing method for high-order tensor data, which can avoid that the vectorization process of the image observation sample set damage the internal structure of the data, simplify the large amount of redundant information in the high-order tensor data in the image observation sample set, and improve the image processing speed. In this processing method for high-order tensor data, the high-order tensor data are divided into three parts: the shared subspace component, the personality subspace component and the noise part; the shared subspace component and the personality subspace component respectively represent the high-order tensor data as a group of linear combination of the tensor base and the vector coefficient; the variational EM method is used to solve the base tensor and the vector coefficient; design a classifier to classify the test samples by comparing the edge distribution of samples.

CROSS REFERENCE TO THE RELATED APPLICATIONS

This application is based upon and claims priority to Chinese PatentApplication No. CN201910022095.X, filed on Jan. 10, 2019, the entirecontents of which are incorporated herein by reference.

TECHNICAL FIELD

The present invention relates to the technical field of data processing,and particularly to a processing method for high order tensor data,which is mainly used for the processing of video sequence, RGB-Dsequence and other data.

BACKGROUND

With the development of science and technology, a large number ofhigh-order tensor data emerge, such as video sequence, RGB-D sequenceand so on. Because there are a lot of redundant information inhigh-order tensor data, it is often necessary to extract features orreduce dimensions before data analysis.

The traditional way of dimensionality reduction is to quantize thetensor data, and then use the vector dimensionality reduction method.The process of dimensionality will destroy the structure of the data,and can not make use of the internal structure of high-order tensordata. In addition, two commonly used dimensionality reduction methodsfor higher-order tensors are CandeComp/PARAFAC (CP) decomposition andTucker decomposition. The feature dimension extracted by CPdecomposition is determined by the rank of CP decomposition, which isnot flexible. Tucker decomposition is more complex to extract featuresfrom test samples, so it is more suitable for clustering thanclassification.

SUMMARY

The technical problem addressed by the present invention is to overcomethe deficiency in the prior art, and to provide a processing method forhigh-order tensor data, which can avoid that the vectorization processof the image observation sample set damage the internal structure of thedata, which is flexible and simple, which can extract the discriminantfeatures of the tensor sample, simplify the large amount of redundantinformation in the high-order tensor data in the image observationsample set, and improve the image processing speed.

The technical solution of the present invention is that, in thisprocessing method for high-order tensor data, the high-order tensor dataare divided into three parts: the shared subspace component, thepersonality subspace component and the noise part; the shared subspacecomponent and the personality subspace component respectively representthe high-order tensor data as a group of linear combination of thetensor base and the vector coefficient; the variational EM method isused to solve the base tensor and the vector coefficient; design aclassifier to classify the test samples by comparing the edgedistribution of samples.

Compared with the traditional linear discriminant method, the presentinvention extracts the common feature and the individual featuredirectly from the tensor data in two spaces respectively. In each space,by the structural feature of the tensor, it constructs a representationway from the tensor data to the vector data, so as to extract thediscriminant feature of the tensor sample, thus avoiding that thevectorization process of the image observation sample set damage theinternal structure of the data, which is flexible and simple, which canextract the discriminant features of the tensor sample, simplify thelarge amount of redundant information in the high-order tensor data inthe image observation sample set, and improve the image processingspeed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic diagram of formula (1) according to theinvention.

FIG. 2 shows a graphic illustration of probabilistic recognition methodaccording to the present invention.

FIG. 3 shows a visual effect of the reconstruction of the invention onthe extended Yale database.

FIG. 4 shows reconstructed images of the invention on AR database.

DETAILED DESCRIPTION OF THE EMBODIMENTS

As shown as FIG. 1, in this processing method for high-order tensordata, the high-order tensor data are divided into three parts: theshared subspace component, the personality subspace component and thenoise part; the shared subspace component and the personality subspacecomponent respectively represent the high-order tensor data as a groupof linear combination of the tensor base and the vector coefficient; thevariational EM method is used to solve the base tensor and the vectorcoefficient; design a classifier to classify the test samples bycomparing the edge distribution of samples.

Compared with the traditional linear discriminant method, the presentinvention extracts the common feature and the individual featuredirectly from the tensor data in two spaces respectively. In each space,by the structural feature of the tensor, it constructs a representationway from the tensor data to the vector data, so as to extract thediscriminant feature of the tensor sample, thus avoiding that thevectorization process of the image observation sample set damage theinternal structure of the data, which is flexible and simple, which canextract the discriminant features of the tensor sample, simplify thelarge amount of redundant information in the high-order tensor data inthe image observation sample set, and improve the image processingspeed.

Preferably, two dimensional model is assuming an observed data sety={Y_(ij)}_(i=1,j=1) ^(I,N) ^(i) , which consist of N images of Iindividuals and each individual has Ni images. Yij represents the jthimage of the ith individual and it will be modeled as formula (1):Y _(ij)=

×₃ h _(i) ^(T)+

×₃ g _(ij) ^(T) +M+E _(ij)  (1)In this model,

×₃ h_(i) ^(T) or

×₃ g_(ij) ^(T) can be regarded as the generalization of vectorrepresentation for a given sample,

and

are the factor tensors consisting of base matrices in shared subspaceand individual subspace, respectively. ×₃ represents the product oftensor and matrix (vector), h_(i)∈

^(K) ¹ is a shared latent variable crossing all the images Y_(i1), . . ., Y_(iN) _(i) of object/class i and g_(ij)∈

^(K) ² represents the coefficient of the jth sample of the ith object inindividual space. M is the mean matrix of the data, suppose M=0. E_(ij)is a stochastic noise term, each element in which follows the normaldistribution

(0, σ²).

Preferably, the vector of model (1) is formula (2)vec(Y _(ij))=Fh _(i) +Wg _(ij)+vec(M)+vec(E _(ij))  (2)in which, the columns and f_(k) ₁ and w_(k) ₂ in matrix F and W isrespectively k1, k2 face of tensor

and

, are used to restrict the base tensor by CP decomposition of tensor.Assuming

$\mathcal{F} = {{〚{{\lambda;F^{(1)}},F^{(2)},F^{(h)}}〛} = {\sum\limits_{r = 1}^{R_{1}}{\lambda_{r}f_{1,r}^{(1)}{\bullet f}_{1,r}^{(2)}{\bullet f}_{1,r}^{(h)}}}}$and$\mathcal{W} = {{〚{{\eta;W^{(1)}},W^{(2)},W^{(g)}}〛} = {\sum\limits_{r = 1}^{R_{2}}{\eta_{r}w_{1,r}^{(1)}\bullet\; w_{1,r}^{(2)}\bullet\;{w_{1,r}^{(g)}.}}}}$

Preferably, assuming Q(h, g, ρ) is any joint distribution over thehidden variables, find that a lower bound on likelihood of the observeddata is formula (2),

$\begin{matrix}{\begin{matrix}{{\mathcal{L}( \ominus )} = {\log\;{p\left( {\mathcal{Y}❘ \ominus} \right.}}} \\{\geq \,{\int_{h}{\int_{g}{\int_{\rho}{{\mathcal{Q}\left( {h,g,\rho} \right)}\log\;\frac{p\left( {\mathcal{Y},h,g,\rho} \right)}{\mathcal{Q}\left( {h,g,\rho} \right)}{dhdgd}\;\rho}}}}} \\{= {{E_{h,g,\rho}\left\lbrack {\log\;{p\left( {{\mathcal{Y}❘h},g,\rho} \right)}} \right\rbrack} + {E_{h}\left\lbrack {\log\;\frac{p(h)}{\mathcal{Q}(h)}} \right\rbrack}}} \\{{+ {E_{g}\left\lbrack {\log\;\frac{p(g)}{\mathcal{Q}(g)}} \right\rbrack}} + {E_{\rho}\left\lbrack {\log\;\frac{p(\rho)}{\mathcal{Q}(\rho)}} \right\rbrack}} \\{\overset{\Delta}{=}{\hat{\mathcal{L}}\left( {{\mathcal{Q}(h)},{\mathcal{Q}(g)},{\mathcal{Q}(\rho)}, \ominus} \right)}}\end{matrix}\quad} & (3)\end{matrix}$The second equation of the formula (3) is based on the feature ofseparation of Q(h, g, ρ), Q(h, g, ρ)=Q(h)Q(g)Q(ρ).

Preferably, assuming a high-order observed data set {y_(ij)|y_(ij)∈

^(D) ¹ ^(× . . . ×D) ^(M) ; i=1, . . . , I, j=1, . . . , N_(i),Σ_(i)N_(i)=N}.

each sample is a tensor of order M, and the linear discriminant analysismodel of higher-order tensor is formula (4)y _(ij)=

×_((M+1)) h _(i) ^(T)+

×_((M+1)) g _(ij) ^(T)+

_(ij)  (4)Wherein h_(i)∈

^(K) ¹ , g_(ij)∈

^(K) ² ,

and

is a tensor of order (M+1), whose size is respectively D₁× . . .×D_(M)×K₁ and D₁× . . . ×D_(M)×K₂. Assuming base tensor has thestructure of CP decomposition,

$\begin{matrix}{\mathcal{F} = {〚{F^{(1)},F^{(2)},\ldots\mspace{14mu},F^{(M)},F^{(h)}}〛}} \\{= {\sum\limits_{r = 1}^{R_{1}}{f_{1,r}^{(1)}\bullet\; f_{1,r}^{(2)}\bullet\mspace{14mu}\ldots\mspace{14mu}\bullet\; f_{1,r}^{(M)}\bullet\; f_{1,r}^{(h)}}}}\end{matrix}$ and $\begin{matrix}{\mathcal{W} = {〚{W^{(1)},W^{(2)},\ldots\mspace{14mu},W^{(M)},W^{(g)}}〛}} \\{= {\sum\limits_{r = 1}^{R_{2}}{w_{1,r}^{(1)}\bullet\; w_{1,r}^{(2)}\bullet\mspace{14mu}\ldots\mspace{14mu}\bullet\; w_{1,r}^{(M)}\bullet\;{w_{1,r}^{(g)}.}}}}\end{matrix}$

Preferably, for higher-order model, EM algorithm is used to solve theproblem, therefore the posterior of h_(i) is still a multivariate normaldistribution. The mean and covariance matrix are, respectively,

$\begin{matrix}{{\overset{\_}{h}}_{i} = {\left( {\sum\limits_{\mathcal{F}}{{+ \frac{1}{\overset{\_}{\rho}\; N_{i}}}I_{K_{1}}}} \right)a_{i}}} \\{\sum\limits_{h}{= \left( {I_{{K\; 1}\;} + {\frac{\overset{\_}{\rho}\; N}{I}\sum\limits_{\mathcal{F}}}} \right)^{- 1}}}\end{matrix}$where,

=F ^((h))[(F ^((m)T) F ^((m)))⊙( F ^((m)T) F ^((m)]) F ^((h)T)a _(i) =F ^((h))diag(F ^((m)T) Y _((m)) ^(i) F ^((m))).Where ⊙ represents the Hadamard elementwise product.Define

${{\overset{\_}{\mathcal{Y}}}_{i} = {\frac{1}{N_{i}}{\sum\limits_{j = 1}^{N_{i}}\left( {\mathcal{Y}_{ij} - {W \times_{({M + 1})}g_{ij}}} \right)}}},$then Y _((m)) ^(i) represents the mode-m metricized version of y _(i),andF ^((m)) =F ^((M))

. . .

F ^((m+1)) F ^((m−1))

. . .

F ⁽¹⁾Where

represents the Khatri-Rao product.By computing, the posterior of g _(ij) is still a multivariate Gaussdistribution, and the mean and covariance of latent variable g_(ij) havebecomeg _(ij)=(

+1/ρI _(K))⁻¹ b _(ij)Σ_(g)=(I _(K)+ρ

)⁻¹whereb _(ij) =W ^((g))diag(W ^((m)T) Ŷ _((m)) ^(ij) W ^((m)))Σ_(W) =W ^((g))[(W ^((m)T) W ^((m)))⊙( W ^((m)T) W ^((m)))]W ^((g)T)Let ŷ_(ij)=y_(ij)−

×_((M+1))h_(i) ^(T), and Ŷ_((m)) ^(ij) represents the mode-m metricizedversion of ŷ_(ij). AndW ^((m)) =W ^((M))

. . .

W ^((m+1))

W ^((m−1))

. . .

W ⁽¹⁾The optimal Q*(ρ) is still a Gamma distribution.In M-step, taking derivatives of

with respect to F^((n)) (n=1, . . . , N) and F^((h)), and letting thembe zero, obtain

$F^{(m)} = {\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{(m)}^{ij}{\overset{\_}{F}}^{(m)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}}}} \right\rbrack\left\lbrack {\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right) \odot \left( {{{NF}^{{(h)}T}{\sum\limits_{h}F^{(h)}}} + {F^{{(h)}T}\overset{\sim}{H}{\overset{\sim}{H}}^{T}F^{(h)}}} \right)} \right\rbrack}^{- 1}$$F^{(h)} = {{{\left\lbrack {{\overset{\sim}{H}{\overset{\sim}{H}}^{T}} + {N\;\sum\limits_{h}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{h_{i}\left\lbrack {{diag}\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\sim}{Y}}_{(m)}^{{({ij})}T}F^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}\left\lbrack {\left( {F^{{(m)}T}F^{(m)}} \right) \odot \left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right)} \right\rbrack}^{- 1}.}$Denote {tilde over (Y)}_((m)) ^(ij) presents the mod-m metricizedversion of {tilde over (y)}_(ij). And {tilde over (y)}_(ij)=y_(ij)−

×_((M+1))g_(ij),Taking derivatives of

with respect to W^((m)) (m=1, . . . , M) and W^((h)), and letting thembe zero, obtain

$W^{(m)} = {{\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{(m)}^{ij}{\overset{\_}{W}}^{(m)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}}}} \right\rbrack\left\lbrack {\left( {{\overset{\_}{W}}^{{(m)}T}{\overset{\_}{W}}^{(m)}} \right) \odot \left( {{{NW}^{{(g)}T}{\sum\limits_{g}W^{(g)}}} + {W^{{(g)}T}{GG}^{T}W^{(g)}}} \right)} \right\rbrack}^{- 1}\mspace{14mu}{and}}$$W^{(h)} = {{{\left\lbrack {{GG}^{T} + {N\;\sum\limits_{g}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{g_{ij}\left\lbrack {{diag}\left( {{\overset{\_}{W}}^{{(m)}T}{\overset{\sim}{Y}}_{(m)}^{{({ij})}T}W^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}\left\lbrack {\left( {W^{{(m)}T}W^{(m)}} \right) \odot \left( {{\overset{\_}{W}}^{{(m)}T}{\overset{\_}{W}}^{(m)}} \right)} \right\rbrack}^{- 1}.}$

Preferably, for a sample Y_(p) to be classified, its category bycalculating the likelihood under a model. The key point of recognitionis that samples in the same class share the same identity variable.Assuming Y_(1 . . . N) is observed samples, if Y_(p) and one particularYi belong to the same category, they shall share the same h.

Preferably, Y1 and Yp belongs to the same category and both of identityvariables are h1. In this case, the likelihood of observed samples andYp as p₁(Y₁, Y₂, Y_(p)), Yp and Y2 came from the same category with theidentity variable h2 and the likelihood can be represented as p₂(Y₁, Y₂,Y_(p)). Then we compare p₁ and p₂, and find out the largest likelihoodvalue. Thus, Yp belongs to the sample's label corresponding to thelargest likelihood. The likelihood of samples can be calculated asfollows.

Because Y_(ij) follows a matrix-variate normal conditioned on the latentvariables h and g. For convenience, we work with p(y_(ij)) instead ofp(Y_(ij)). Here, we note y_(ij):=vec(Y_(ij)). Thus, the proposed model(1) can becomey _(ij) =Fh _(i) +Wg _(ij) +e _(ij)Where F=(F⁽²⁾⊙F⁽¹⁾)F^((h)T) and W=(W⁽²⁾⊙W⁽¹⁾)W^((g)T)These factor matrices have been learned in training phrase, which areknown to the recognition processing. Thus, the likelihood of N equals

$\begin{matrix}{{p_{n}\left( y_{{1\mspace{14mu}\ldots\mspace{14mu} N},p} \right)} = {\prod\limits_{{i = 1},{i \neq n}}^{N}\;{{p\left( y_{i} \right)}{p\left( {y_{p},y_{n}} \right)}}}} & (20)\end{matrix}$Next, we illustrate the marginal distribution of n images y_(1 . . . n)that share the same identity variable h. The generative equations become

$\begin{matrix}{\begin{bmatrix}y_{1} \\y_{2} \\\vdots \\y_{n}\end{bmatrix} = {{\begin{bmatrix}F & W & 0 & \ldots & 0 \\F & 0 & W & \ldots & 0 \\\vdots & \vdots & \vdots & \ddots & \vdots \\F & 0 & 0 & \ldots & W\end{bmatrix}\begin{bmatrix}h \\g_{1} \\g_{2} \\\vdots \\g_{n}\end{bmatrix}} + \begin{bmatrix}e_{1} \\e_{2} \\\vdots \\e_{n}\end{bmatrix}}} & (5)\end{matrix}$noting,y′=Az′+m′+e′Taking the following probabilistic assumptionp(z′)=

(0,I)andp(e′)=

(0,σ² I)Converts the model (5) into the form of probabilistic PCA. Thus, themarginal distribution over the observed y′ can be easily obtained byintegrating out the latent variables asp(y′)=

(m′,AA ^(T)+σ² I)  (21)By (21), both p(y_(i)) and p(y_(p), y_(n)) in (20) can be calculated andthen we can obtain P_(n)(y_(1, . . . , N,p)). Comparing allp_(n)(y_(1, . . . , N,p)) and finding out the largest value, then thesample Yp is divided into the sample's label corresponding to thelargest p_(n)(y_(1, . . . , N,p)).

The invention is described in more detail below.

The technical solution adopted by the invention is a probabilisticlinear discriminant analysis method based on tensor data vectorexpression, and the specific implementation process of the method is asfollows.

Two Dimensional Model Construction

Assuming an observed data set y={Y_(ij)}_(i=1,j=1) ^(I,N) ^(i) , whichconsist of N images of I individuals and each individual has Ni images(Σ_(i=1) ^(I)N_(i)=N). Yij represents the jth image of the ithindividual and it will be modeled as formula (1):Y _(ij)=

×₃ h _(i) ^(T)+

×₃ g _(ij) ^(T) +M+E _(ij)  (1)

In this model,

×₃ h_(i) ^(T) or

×₃ g_(ij) ^(T) can be regarded as the generalization of vectorrepresentation for a given sample, in which coefficient is vector, asshown as FIG. 1.

and

are the factor tensors consisting of base matrices in shared subspaceand individual subspace, respectively. ×₃ represents the product oftensor and matrix (vector). h_(i)∈

^(K) ¹ is a shared latent variable crossing all the images Y_(i1), . . ., Y_(iN) _(i) of object/class i and g_(ij)∈

^(K) ² represents the coefficient of the jth sample of the ith object inindividual space. M is the mean matrix of the data. Without loss ofgenerality, we suppose M=0 (a zero matrix). E_(ij) is a stochastic noiseterm, each element in which follows the normal distribution

(0, σ²).

Furthermore, we impose a prior of multivariate normal distribution onlatent variables to generate a Bayesian model

${p\left( h_{i} \right)} = {{\mathcal{N}\left( {{h_{i}❘0},I_{K_{1}}} \right)} = {\left( \frac{1}{2\;\pi} \right)^{K_{1}/2}\exp\left\{ {{- \frac{1}{2}}h_{i}^{T}h_{i}} \right\}}}$and${p\left( g_{ij} \right)} = {{\mathcal{N}\left( {{g_{ij}❘0},I_{K_{2}}} \right)} = {\left( \frac{1}{2\pi} \right)^{K_{2}/2}\exp\left\{ {{- \frac{1}{2}}g_{ij}^{T}g_{ij}} \right\}}}$

Letting

${\rho = \frac{1}{\sigma^{2}}},$we assume σ satisfies the following Gamma prior:

${p_{\sigma}(\rho)} = {{\Gamma\left( {{\rho ❘a},b} \right)} = {\frac{b^{a}}{\Gamma(a)}\rho^{a - 1}\exp\left\{ {{- b}\;\rho} \right\}}}$

After vectorizing model (1), we can obtain the following model:vec(Y _(ij))=Fh _(i) +Wg _(ij)+vec(M)+vec(E _(ij))  (2)

in which, the columns f_(k) ₁ and w_(k) ₂ in matrix F and W isrespectively k1, k2 face of tensor

and

, are used to restrict the base tensor by CP decomposition of tensor.Therefore, the model is transformed into the traditional probabilisticlinear discriminant analysis. However, a major problem is that theparameters in the model will increase with the increase of the originaldata. For example, for N-order tensor data, the projections

and

are (N+1)-order tensors with K₁Π_(n)D_(n) and K₂Π_(n)D_(n) components,respectively. They will quickly reaches billions when the modedimensionalities K₁, K₂, D_(1:N) and N are moderate. These make itextremely difficult to learn parameters for base tensors. Hence, wepropose to introduce tensor CP decompositions in constructing themultiplicative interactions in tensors

and

, respectively. Assuming

$\mathcal{F} = {{〚{{\lambda;F^{(1)}},F^{(2)},F^{(h)}}〛} = {\sum\limits_{r = 1}^{R_{1}}{\lambda_{r}f_{1,r}^{(1)}\bullet\; f_{1,r}^{(2)}\bullet\; f_{1,r}^{(h)}}}}$and$\mathcal{W} = {{〚{{\eta;W^{(1)}},W^{(2)},W^{(g)}}〛} = {\sum\limits_{r = 1}^{R_{2}}{\eta_{r}w_{1,r}^{(1)}\bullet\; w_{1,r}^{(2)}\bullet\;{w_{1,r}^{(g)}.}}}}$

Solution of the Model

For the above-mentioned model, h, g and ρ are the model latent variablesand θ={F⁽¹⁾, F⁽²⁾, F^((h)), W⁽¹⁾, W⁽²⁾, W^((g))} represents all themodel parameters. Therefore, the joint distribution of a set of givenobservations y and all the latent variables can be expressed as

${p\left( {\mathcal{Y},h,g,{\rho ❘ \ominus}} \right)} = {\prod\limits_{i,j}\;{\mathcal{N}\left( {{{Y_{i} - {\mathcal{F} \times {{}_{}^{}{}_{}^{}}} - {\mathcal{W} \times {{}_{}^{}{}_{}^{}}}}❘0},{\sigma\; I},{\sigma\; I}} \right)}}$𝒩(h_(i)❘0, I_(K₁))𝒩(g_(ij)❘0, I_(K₂))p_(σ)(ρ).

So the likelihood of the sample is

(θ)=log p(y|θ)=log ∫∫∫p(y,h,g,ρ|θ)dhdgdρ

Furthermore, we investigate the variational electromagnetic (EM)algorithm for the proposed TPLDA model.

As the proposed model depends on unobserved latent variables, we need tofind a lower bound on likelihood of the observed data for anyvariational distribution Q(h, g, ρ) over the hidden variables.

$\begin{matrix}{{\mathcal{L}( \ominus )} = {{{\log\;{p\left( {\mathcal{Y}❘ \ominus} \right)}} \geq {\int_{h}\ {\int_{g}{\int_{\rho}{{\mathcal{Q}\left( {h,g,\rho} \right)}\log\;\frac{p\left( {\mathcal{Y},h,g,\rho} \right)}{\mathcal{Q}\left( {h,g,\rho} \right)}{dhdgdp}}}}}} = {{{E_{h,g,\rho}\left\lbrack {\log\;{p\left( {{\mathcal{Y}❘h},g,\rho} \right)}} \right\rbrack} + {E_{h}\left\lbrack {\log\;\frac{p(h)}{\mathcal{Q}(h)}} \right\rbrack} + {E_{g}\left\lbrack {\log\;\frac{p(g)}{\mathcal{Q}(g)}} \right\rbrack} + {E_{p}\left\lbrack {\log\;\frac{p(\rho)}{\mathcal{Q}(\rho)}} \right\rbrack}}\overset{\Delta}{=}{\hat{\mathcal{L}}\left( {{\mathcal{Q}(h)},{\mathcal{Q}(g)},{\mathcal{Q}(\rho)}, \ominus} \right)}}}} & (3)\end{matrix}$

By taking a separable Q(h, g, ρ), that is Q(h, g, ρ)=Q(h)Q(g)Q(ρ), thesecond equation can be deduced. The variational EM algorithm can beimplemented as follows.

-   -   1) Variational E-Step: with fixing the current parameter value        θ, Q-distributions of latent variables are calculated in E-Step.    -   (i). Updating Q-Distribution of h_(i): From model (1), we can        see that        and g_(ij) are constant with respect to h_(i); hence, we can        rewrite the model (1) as

${\sum\limits_{j = 1}^{N_{i}}\left( {Y_{ij} - {\mathcal{W} \times {{}_{}^{}{}_{}^{}}}} \right)} = {{N_{i}\mathcal{F} \times {{}_{}^{}{}_{}^{}}} + {\sum\limits_{j = 1}^{N_{i}}E_{ij}}}$

Define

${\overset{\_}{Y}}_{i} = {{\frac{1}{N_{i}}{\sum\limits_{j = 1}^{N_{i}}{\left( {Y_{ij} - {W \times {{}_{}^{}{}_{}^{}}}} \right)\mspace{14mu}{and}\mspace{14mu}{\overset{\_}{E}}_{i}}}} = {\frac{1}{N_{i}}{\sum\limits_{j = 1}^{N_{i}}{E_{ij}.}}}}$Then, the above-mentioned equation becomesY _(i) =

×₃ h _(i) ^(T)+ E _(i)

And each element in Ē_(i) follows the normal distribution

$\mathcal{N}\left( {0,\frac{\sigma^{2}}{N_{i}}} \right)$with zero mean.

In (3), the last two expectation terms are constant with respect toh_(i), hence we only need consider computing the first two expectations.The likelihood of observed data p(y|h, g, ρ) in (3) can be obtainedusing the following equation.

${p\left( {{\mathcal{Y}❘h},g,\rho} \right)} = {\prod\limits_{i = 1}^{I}\;{\mathcal{N}\left( {{{\overset{\_}{Y}}_{i}❘{\mathcal{F} \times {{}_{}^{}{}_{}^{}}}},\frac{\sigma^{2}}{N_{i}}} \right)}}$

Therefore, the posterior of h_(i) is still a multivariate normaldistribution. The mean and covariance matrix are, respectively,

${\overset{\_}{h}}_{i} = {\left( {\sum\limits_{\mathcal{F}}{{+ \frac{1}{\overset{\_}{\rho}\; N_{i}}}I_{K_{1}}}} \right)a_{i}}$and$\sum\limits_{h}{= \left( {I_{K\; 1} + {\frac{\overset{\_}{\rho}\; N}{I}\Sigma_{\mathcal{F}}}} \right)^{- 1}}$

wherea _(i) =F ^((h))diag(F ^((1)T) Y _(i) F ⁽²⁾)and

=F ^((h))[(F ^((1)T) F ⁽¹⁾)⊙(F ^((2)T) F ⁽²⁾)]F ^((h)T)

Where ⊙ represents the Hadamard elementwise product.

ii) Updating Q-Distribution of g_(ij):

we rewrite model (1) as,Y _(ij)−

×₃ h _(i) ^(T)=

×₃ g _(ij) ^(T) +E _(ij)

Define Ŷ_(ij)=Y_(ij)−

×₃ h_(i) ^(T). Then the above-mentioned model has becomeŶ _(ij)=

×₃ g _(ij) ^(T) +E _(ij)

Thus, the posterior of g_(ij) is still a multivariate normaldistribution. The mean and covariance matrix are, respectively,g _(ij)=(

+1/ρI _(K))⁻¹ b _(ij)andΣ_(g)=(I _(K)+ρ

)⁻¹Whereb _(ij) =W ^((g))diag(W ^((1)T) Ŷ _(ij) W ⁽²⁾)andΣ_(W) =W ^((g))[(W ^((1)T) W ⁽¹⁾)⊙(W ^((2)T) W ⁽²⁾)]W ^((g)T)

iii) Updating the posterior of ρ:

utilizing Bayesian Theory, the log of optimal Q*(ρ) distribution can becalculated as expectation of the log joint distribution over all thehidden and visible variables with respect to all the other latentvariables. Thus, we can get the following equation:

$\begin{matrix}{{\log\;{Q^{*}(\rho)}} = {E_{h,g}\left\lbrack {\log\;{p\left( {\mathcal{Y},h,g,{\rho ❘ \ominus}} \right)}} \right\rbrack}} \\{= {{\frac{D_{1}D_{2}N}{2}\log\;\rho} - {\frac{\rho}{2}\psi} + {\left( {a - 1} \right)\log\;\rho} - {b\;\rho}}}\end{matrix}\quad$

where

$\psi = {{\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{Y_{ij} - {\mathcal{F} \times {{}_{}^{}{h\_}_{}^{}}} - {\mathcal{W} \times {{}_{}^{}{g\_}_{}^{}}}}}_{F}^{2}}} + {N\;{{tr}\left( {\sum\limits_{\mathcal{F}}\sum\limits_{h}} \right)}} + {N\;{{{tr}\left( {\Sigma_{\mathcal{W}}\Sigma_{g}} \right)}.}}}$

Thus, Q-Distribution of ρ is proportional to

${\mathcal{Q}^{*}(\rho)} \propto {\rho^{a - 1} + {\frac{D_{1}D_{2}N}{2}\exp\left\{ {{{- b}\;\rho} - {\frac{\psi}{2}\rho}} \right\}}}$

Therefore, the best Q*(ρ) is actually a Gamma distribution, wherein theparameters become

$\overset{\_}{a} = {{a + {\frac{D_{1}D_{2}N}{2}\mspace{14mu}{and}\mspace{14mu}\overset{\_}{b}}} = {b + {\frac{1}{2}{\psi.}}}}$

2) Variational M-Step: With all the fixed Q-Distribution of latentvariables, maximize all the parameters of

(Q(h), Q(g), Q(ρ), θ). We select all terms containing parameters θ in(3) and get

${\hat{\mathcal{L}} \propto {E_{h,g,\rho}\left\lbrack {\log\;{p\left( {\mathcal{Y},h,g,{\rho\left\lbrack {\mathcal{F},\mathcal{W}} \right)}} \right\rbrack}} \right\rbrack}} = {{{\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{Y_{ij} - {\mathcal{F} \times {{}_{}^{}{}_{}^{}}} - {\mathcal{W} \times {{}_{}^{}{}_{}^{}}}}}_{F}^{2}}} + {N\;{{tr}\left( {\sum\limits_{\mathcal{F}}\sum\limits_{h}} \right)}} + {N\;{{tr}\left( {\sum\limits_{\mathcal{W}}\sum\limits_{g}} \right)}} + C}\overset{\Delta}{=}{\overset{\sim}{\mathcal{L}}.}}$

Where C represents a constant, which is independent with parameter θ. Inaddition, we can deduce

×₃ h _(i) ^(T) =F ⁽¹⁾Diag(h _(i) ^(T) F ^((h)))F ⁽²⁾ ^(T)and

×₃ g _(ij) ^(T) =W ⁽¹⁾Diag(g _(ij) ^(T) W ^((g))) W ⁽²⁾ ^(T) .

Thus,

can be rewritten as

$\hat{\mathcal{L}} \propto {{\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{{\overset{\sim}{Y}}_{ij} - {F^{(1)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}F^{{(2)}T}}}}_{F}^{2}}} + {N\;{{tr}\left( {\left\lbrack {\left( {F^{{(1)}T}F^{(1)}} \right) \odot \left( {F^{{(2)}T}F^{(2)}} \right)} \right\rbrack F^{{(h)}T}{\sum\limits_{h}F^{(h)}}} \right)}}}$

where{tilde over (Y)} _(ij) =Y _(ij)−

×₃ g _(ij) ^(T)

Taking derivatives of objective function

with respect to F⁽¹⁾, F⁽²⁾ and F^((h)), we have

${F^{(1)} = {{\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{ij}F^{(2)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}}}} \right\rbrack\left\lbrack {\left( {F^{{(2)}T}F^{(2)}} \right) \odot \left( {{{NF}^{{(h)}T}{\sum\limits_{h}F^{(h)}}} + {F^{{(h)}T}\overset{\sim}{H}{\overset{\sim}{H}}^{T}F^{(h)}}} \right)} \right\rbrack}\;}^{- 1}},{F^{(2)} = {{{\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{ij}^{T}F^{(1)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}}}} \right\rbrack\left\lbrack {\left( {F^{{(1)}T}F^{(1)}} \right) \odot \left( {{{NF}^{{(h)}T}{\sum\limits_{h}F^{(h)}}} + {F^{{(h)}T}\overset{\sim}{H}{\overset{\sim}{H}}^{T}F^{(h)}}} \right)} \right\rbrack}\;}^{- 1}\mspace{14mu}{and}}}$${F^{(h)} = {{{\left\lbrack {{\overset{\sim}{H}{\overset{\sim}{H}}^{T}} + {N\;\sum\limits_{h}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{h_{i}\left\lbrack {{d{iag}}\left( {F^{{(2)}T}{\overset{\sim}{Y}}_{ij}^{T}F^{(1)}} \right)} \right\rbrack}^{T}}} \right\rbrack}\left\lbrack {\left( {F^{{(1)}T}F^{(1)}} \right) \odot \left( {F^{{(2)}T}F^{(2)}} \right)} \right\rbrack}\;}^{- 1}}\ $

Note that {tilde over (H)}∈

^(K) ¹ ^(×N) is made up of all samples' h_(i) and the columnscorresponding to the same identity are equal.

can be rewritten as

$\hat{\mathcal{L}} \propto {{\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{{\overset{\sim}{Y}}_{ij} - {W^{(1)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}W^{{(2)}T}}}}_{F}^{2}}} + \mspace{281mu}{N\;{{tr}\left( {\left\lbrack {\left( {W^{{(1)}T}W^{(1)}} \right) \odot \left( {W^{{(2)}T}W^{(2)}} \right)} \right\rbrack W^{{(g)}T}\Sigma_{g}W^{(g)}} \right)}}}$

Taking derivatives of objective function

with respect to W⁽¹⁾, W⁽²⁾ and W^((h)), and letting them be zero, wehave

${W^{(1)} = {{\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\hat{Y}}_{ij}W^{(2)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}}}} \right\rbrack\left\lbrack {\left( {W^{{(2)}T}W^{(2)}} \right) \odot \left( {{{NW}^{{(g)}T}{\sum\limits_{g}W^{(g)}}} + {W^{{(g)}T}{GG}^{T}W^{(g)}}} \right)} \right\rbrack}\;}^{- 1}}\ $$W^{(2)} = {{{\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\hat{Y}}_{ij}^{T}W^{(1)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}}}} \right\rbrack\left\lbrack {\left( {W^{{(1)}T}W^{(1)}} \right) \odot \left( {{{NW}^{{(g)}T}{\sum\limits_{g}W^{(g)}}} + {W^{{(g)}T}{GG}^{T}W^{(g)}}} \right)} \right\rbrack}\;}^{- 1}\mspace{13mu}{and}}$${W^{(g)} = {{\left\lbrack {{GG}^{T} + {N\;\sum\limits_{g}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{g_{ij}\left\lbrack {{diag}\left( {W^{{(2)}T}{\hat{Y}}_{ij}^{T}W^{(1)}} \right)} \right\rbrack}^{T}}} \right\rbrack}\left\lbrack {\left( {W^{{(1)}T}W^{(1)}} \right) \odot \left( {W^{{(2)}T}W^{(2)}} \right)} \right\rbrack}^{- 1}}\;$

3) Overall Algorithm: The entire variational EM algorithm is composed oftwo steps. In E-Step, fixing all parameters in θ, hidden variables areupdated as their Q-distributions. In M-Step, fixing all the hiddenvariable posteriors we update parameters. The two steps are conductedalternatively until a termination condition is satisfied.

Define {tilde over (H)}_(t) and G_(t) as coefficient matrices in t-step.Thus, the reconstruction errors between the original and thereconstructed image sets can be represented as

${e(t)} = {1 - \frac{{{\mathcal{Y} - {\mathcal{F}_{t} \times {{}_{}^{}\left. H \right.\sim_{}^{}}} - {\mathcal{W}_{t} \times {{}_{}^{}{}_{}^{}}}}}_{F}}{{\mathcal{Y}}_{F}}}$

When the reconstruction errors are enough small or iterative numberexceeds the defined value, the algorithm stops.

High Dimensional Model Construction

assuming a high-order observed data set {y_(ij)|y_(ij)∈

^(D) ¹ ^(× . . . ×D) ^(M) ; i=1, . . . , I, j=1, . . . , N_(i),Σ_(i)N_(i)=N}

each sample is a tensor of order M, and the linear discriminant analysismodel of higher-order tensor isy _(ij)=

×_((M+1)) h _(i) ^(T)+

×_((M+1)) g _(ij) ^(T)+ε_(ij)  (4)

Wherein h_(i)∈

^(K) ¹ , g_(ij)∈

^(K) ² ,

and

is a tensor of order (M+1), whose size is respectively D₁× . . .×D_(M)×K₁ and D₁× . . . ×D_(M)×K₂. Similar with the two order model,assuming base tensor has the structure of CP decomposition,

$\begin{matrix}{\mathcal{F} = {〚{F^{(1)},F^{(2)},\ldots\mspace{14mu},F^{(M)},F^{(h)}}〛}} \\{= {\sum\limits_{r = 1}^{R_{1}}{f_{1,r}^{(1)}\bullet\; f_{1,r}^{(2)}\bullet\mspace{14mu}\ldots\mspace{14mu}\bullet\; f_{1,r}^{(M)}\bullet\; f_{1,r}^{(h)}}}}\end{matrix}{\quad{{and}\begin{matrix}{\mathcal{W} = {〚{W^{(1)},W^{(2)},\ldots\mspace{14mu},W^{(M)},W^{(g)}}〛}} \\{= {\sum\limits_{r = 1}^{R_{2}}{w_{1,r}^{(1)}\bullet\; w_{1,r}^{(2)}\bullet\mspace{14mu}\ldots\mspace{14mu}\bullet\; w_{1,r}^{(M)}\bullet\;{w_{1,r}^{(g)}.}}}}\end{matrix}\quad}}$

The Solution of High Dimensional Model Algorithm

For higher-order model, EM algorithm is used to solve the problem,therefore the posterior of h_(i) is still a multivariate normaldistribution. The mean and covariance matrix are, respectively,

$\begin{matrix}{{\overset{\_}{h}}_{i} = {\left( {\sum\limits_{\mathcal{F}}{{+ \frac{1}{\overset{\_}{\rho}\; N_{i}}}I_{K_{1}}}} \right)a_{i}}} \\{\sum\limits_{h}{= \left( {I_{K\; 1} + {\frac{\overset{\_}{\rho}\; N}{I}\sum_{\mathcal{F}}}} \right)^{- 1}}}\end{matrix}$

where,

=F ^((h))[(F ^((m)T) F ^((m)))⊙( F ^((m)T) F ^((m)))]F ^((h)T)a _(i) =F ^((h))diag(F ^((m)T) Y _((m)) ^(i) F ^((m))).

Where ⊙ represents the Hadamard elementwise product.

Define

${{\overset{\_}{y}}_{i} = {\frac{1}{N_{i}}{\sum\limits_{j = 1}^{N_{i}}\left( {y_{ij} - {\mathcal{W} \times_{({M + 1})}g_{ij}}} \right)}}},$then Y _((m)) ^(i) represents the mode-m metricized version of y _(i),andF ^((m)) =F ^((M))

. . .

F ^((m+1))

F ^((m−1))

. . .

F ⁽¹⁾

Where

represents the Khatri-Rao product.

By computing, the posterior of g _(ij) is still a multivariate Gaussdistribution, and the mean and covariance of latent variable g_(ij) havebecomeg _(ij)=(

+1/ ρI _(K))⁻¹ b _(ij)Σ_(g)=(I _(K)+ρ

)⁻¹whereb _(ij) =W ^((g))diag(W ^((m)T) Ŷ _((m)) ^(ij) W ^((m)))

=W ^((g))[(W ^((m)T) W ^((m)))⊙( W ^((m)T) W ^((m)))]W ^((g)T)

Let ŷ_(ij)=y_(ij)−

×_((M+1))h_(i) ^(T), and Ŷ_((m)) ^(ij) represents the mode-m metricizedversion of ŷ_(ij). AndW ^((m)) =W ^((M))

. . .

W ^((m+1))

W ^((m−1))

. . .

W ⁽¹⁾

The optimal Q*(ρ) is still a Gamma distribution.

In M-step, taking derivatives of

with respect to F^((n)) (n=1, . . . , N) and F^((h))and letting them bezero, obtain

$F^{(m)} = {\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{(m)}^{ij}{\overset{\_}{F}}^{(m)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}}}} \right\rbrack{\quad{{\left\lbrack {\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right) \odot \left( {{{NF}^{{(h)}T}{\sum\limits_{h}F^{(h)}}} + {F^{{(h)}T}\overset{\sim}{H}\;{\overset{\sim}{H}}^{T}F^{(h)}}} \right)} \right\rbrack^{- 1}F^{(h)}} = {{\left\lbrack {{\overset{\sim}{H}\;{\overset{\sim}{H}}^{T}} + {N\;\sum\limits_{h}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{h_{i}\left\lbrack {{diag}\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\sim}{Y}}_{(m)}^{{({ij})}T}F^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}{\quad{\left\lbrack {\left( {F^{{(m)}T}F^{(m)}} \right) \odot \left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right)} \right\rbrack\text{-}{1.}}}}}}}$

In which {tilde over (Y)}_((m)) ^(ij) presents the mode-m metricizedversion of {tilde over (y)}_(ij). And {tilde over (y)}_(ij)=y_(ij)−

×_((M+1))g_(ij).

Taking derivatives of

with respect to W^((m)) (m=1, . . . , M) and W^((h)), and letting thembe zero, obtain

$\begin{matrix}{W^{(m)} = {\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\hat{Y}}_{(m)}^{ij}{\overset{\_}{W}}^{(m)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}}}} \right\rbrack{\quad\left\lbrack {\left( {{\overset{\_}{W}}^{{(m)}T}{\overset{\_}{W}}^{(m)}} \right) \odot \left( {{{NW}^{{(g)}T}{\sum\limits_{g}W^{(g)}}} + {W^{{(g)}T}{GG}^{T}W^{(g)}}} \right)} \right\rbrack^{- 1}}}} & \; \\{\mspace{79mu}{and}} & \; \\{W^{(h)} = {{\left\lbrack {{GG}^{T} + {N\sum\limits_{g}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{g_{ij}\left\lbrack {{diag}\left( {{\overset{\_}{W}}^{{(m)}T}{\hat{Y}}_{(m)}^{{({ij})}T}W^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}{\quad{\left\lbrack {\left( {{\overset{\_}{W}}^{{(m)}T}W^{(m)}} \right) \odot \left( {{\overset{\_}{W}}^{{(m)}T}W^{(m)}} \right)} \right\rbrack^{- 1}.}}}} & \;\end{matrix}$

Recognition Classifier

For a sample Y_(p) to be classified, we judge its category bycalculating the likelihood under a model. The key point of recognitionis that samples in the same class share the same identity variable.Assuming Y_(1 . . . N) is observed samples, if Y_(p) and one particularYi belong to the same category, they shall share the same h. Forexample, in FIG. 2, the left subfigure shows Y1 and Yp belongs to thesame category and both of identity variables are h1. In this case, wenote the likelihood of observed samples and Yp as p₁(Y₁, Y₂, Y_(p)). Theright subfigure shows Yp and Y2 came from the same category with theidentity variable h2 and the likelihood can be represented as p₂(Y₁, Y₂,Y_(p)). Then we compare p₁ and p₂, and find out the largest likelihoodvalue. Thus, Yp belongs to the sample's label corresponding to thelargest likelihood. The likelihood of samples can be calculated asfollows.

Because Y_(ij) follows a matrix-variate normal conditioned on the latentvariables h and g. For convenience, we work with p(y_(ij)) instead ofp(Y_(ij)). Here, we note y_(ij):=vec(Y_(ij)). Thus, the proposed model(1) can becomey _(ij) =Fh _(i) +Wg _(ij) +e _(ij)

Where F=(F⁽²⁾⊙F⁽¹⁾)F^((h)T) and W=(W⁽²⁾⊙W⁽¹⁾)W^((g)T). These factormatrices have been learned in training phrase, which are known to therecognition processing. Thus, the likelihood of N equals

$\begin{matrix}{{p_{n}\left( y_{{1\;\ldots\mspace{11mu} N},p} \right)} = {\prod\limits_{{i = 1},{i \neq n}}^{N}{{p\left( y_{i} \right)}{p\left( {y_{p},y_{n}} \right)}}}} & (20)\end{matrix}$

Next, we illustrate the marginal distribution of n images y_(1 . . . n)that share the same identity variable h. The generative equations become

$\begin{matrix}{\begin{bmatrix}y_{1} \\y_{2} \\\vdots \\y_{n}\end{bmatrix} = {{\begin{bmatrix}F & W & 0 & \ldots & 0 \\F & 0 & W & \ldots & 0 \\\vdots & \vdots & \vdots & \ddots & \vdots \\F & 0 & 0 & \ldots & W\end{bmatrix}\begin{bmatrix}h \\g_{1} \\g_{2} \\\vdots \\g_{n}\end{bmatrix}} + \begin{bmatrix}e_{1} \\e_{2} \\\vdots \\e_{n}\end{bmatrix}}} & (5)\end{matrix}$

noting,y′=Az′+m′+e′

Taking the following probabilistic assumptionp(z′)=

(0,I)andp(e′)=

(0,σ² I)

Converts the model (5) into the form of probabilistic PCA. Thus, themarginal distribution over the observed y′ can be easily obtained byintegrating out the latent variables asp(y′)=

(m′,AA ^(T)+σ² I)  (21)

By (21), both p(y_(i)) and p(y_(p), y_(n)) (20) can be calculated andthen we can obtain p_(n)(y_(1, . . . , N, p)). Comparing allp_(n)(y_(1, . . . , N, p)) and finding out the largest value, then thesample Yp is divided into the sample's label corresponding to thelargest p_(n)(y_(1, . . . , N, p))

Experimental Effect of the Invention

The following two public databases are used to test the reconstructionexperiment.

1) The Extended Yale Face Database

(http://vision.ucsd.edu/content/yale-face-database)

All images in extended Yale database were captured from 38 individualswith total of 2414 frontal face images. There are 59-64 images for eachindividual and these images were captured in different laboratories withthe same lighting conditions. In each individual, we randomly select 40images as training sample and the remaining images as a testing sample.The resolution of all images is cropped and normalized to 32*32 pixels.Table 1 lists the comparison results between the invention and thetraditional PLDA method, including the number of free parameters ofprojections, reconstruction results, and recognition rates. In the twomethods, reduced dimension is K1=38, K2=50 on extended Yale database.The larger the reconstruction error value in the list, the better thereconstruction effect. We run each experiment ten times and record themean recognition rate and variance. The results of TPLDA algorithm aregiven under four cases: R1=R2=170,200,250. From these results, we cansee that the number of free parameters of TPLDA is lower than PLDAalgorithm with comparable recognition rates obviously.

TABLE 1 TPLDA PLDA 170 200 250 The number of 90112 25184 28244 33344free parameter Reconstruction   0.8511   0.8686   0.8754   0.8837 errorRecognition   0.9512 ±   0.9555 ±   0.9588 ±   0.9585 ± rate   0.0027  0.0034   0.0032   0.0029

FIG. 3 shows the reconstruction visual results on extended Yaledatabase. The first row shows 12 original images. The second row liststhe images reconstructed by subspace component

×₃ h_(i) ^(T), which are the same for each identity and they did notcontain illumination information. The reconstructed images

×₃ g_(ij) ^(T) are shown in the third row. We can observe that thereconstructed images by

×₃ g_(ij) ^(T) can illustrate more detailed illumination information.

2) AR Face Database

(http://rvl1.ecn.purdue.edu/aleix/aleix_face_DB.html)

The AR face database contains 126 subjects with over 4000 color images.The images of each subject are captured by changing facial expressions,illumination conditions, and occlusions. Each individual contains 26frontal view images, where the first 13 images are taken in the sameperiod and the last 13 images are taken after 2 weeks. All originalimages are RGB color images with a size of 768*576. In the experiments,all images are downsampled to 70*60 pixels. In this experiment, 100individuals are selected, whose first 13 images are used for trainingand whose last 13 images for testing.

Table 2 lists the comparison results between the invention and thetraditional PLDA method, in which reduced dimension is K1=100, K2=100.In this experiment, we record the above-mentioned results under fourcases: R1=R2=150,170,200,250. From this table, we can conclude that whenthe number of parameters is far less than the number of PLDA parameters,the recognition result is higher than the traditional method.

TABLE 2 TPLDA PLDA 150 170 200 250 The number of 840000 69000 7820092000 115000 free parameter Reconstruction    0.9165   0.9067   0.9089  0.9113    0.9144 error Recognition    0.7976 ±   0.7927 ±   0.803 ±  0.8242 ±    0.8341 ± rate    0.0038   0.0110   0.0102   0.0058   0.0053

FIG. 4 shows the reconstructed results of the invention on AR database.All images of an individual are shown in the first row in this figure.The second and third rows are the images reconstructed by

×₃ h_(i) ^(T) and

×₃ g_(ij) ^(T), respectively. It can be clearly seen from the figurethat the model of the invention can well separate the identityinformation from details information of a person.

Recognition and Classification of 3D Video Sequences

In order to verify the effectiveness of the proposed method on videosequence, the clustering effect of the method on Ballet database istested. Ballet database contains 44 video clips and these clips werecaptured for three subjects performing eight complex action patterns. Inthis experiment, each video is clipped into subgroup as an image set.Then we can obtain 713 image sets. In training phase, we randomly selectone-third samples to train and the left samples are used for testing.Table 3 lists the recognition results of the six algorithms. The reduceddimension is 30, 50 and 70, respectively. From the table, we can seethat the proposed algorithm TPLDA can obtain the best results.

TABLE 3 TPLDA Algorithm 30 50 70 PCA 0.8407 0.8323 0.8365 CP  0.8394 ±0.01455 0.8216 ± 0.0085 0.8115 ± 0.0111 TBV-DR 0.8476 ± 0.0125 0.8430 ±0.0129 0.8361 ± 0.0137 TRPDA 0.8553 (3, 3, 3) 0.8491 (5, 5, 3)  0.8532(8, 8, 5)   STDA 0.8550 (5, 5, 3) 0.8512 (10, 10, 5) 0.8491 (15, 15, 10)TPLDA 0.8598 ± 0.0178 0.8523 ± 0.0149 0.08551 ± 0.0170 

The above contents are only the preferable embodiments of the presentinvention, and do not limit the present invention in any manner. Anyimprovements, amendments and alternative changes made to the aboveembodiments according to the technical spirit of the present inventionshall fall within the claimed scope of the present invention.

What is claimed is:
 1. A processing method for high-order tensor data, comprising: dividing the high-order tensor data into three parts comprising a shared subspace component, a personality subspace component and a noise part; wherein, the shared subspace component and the personality subspace component respectively represent the high-order tensor data as a group of linear combination of a tensor base and a vector coefficient; using a variational EM method to solve the tensor base and the vector coefficient; and designing a classifier to classify a plurality of test samples by comparing an edge distribution of the plurality of samples.
 2. The processing method for high-order tensor data according to claim 1, further comprising forming a two dimensional model assuming an observed data set y={Y_(ij)}_(i=1,j=1) ^(I,N) ^(i) ; wherein the observed data set consist of N images of I individuals and each individual has Ni images; Yij represents a jth image of an ith individual and Yij is modeled as formula (1): Y _(ij)=

×₃ h _(i) ^(T)+

×₃ g _(ij) ^(T) +M+E _(ij)  (1) in the formula (1),

×₃ h_(i) ^(T) or

×₃ g_(ij) ^(T) is a generalization of vector representation for a given sample,

and

are factor tensors consisting of a plurality of base matrices in a shared subspace and an individual subspace, respectively; ×₃ represents a product of tensor and matrix (vector), h_(i)∈

^(K) ¹ is a shared latent variable crossing all of the plurality of images Y_(i1), . . . , Y_(iN) _(i) of object/class i and g_(ij)∈

^(K) ² represents the vector coefficient of the jth sample of the ith object in the individual space; M is a mean matrix of the high-order data; E_(ij) is a stochastic noise term, each element in the stochastic noise term follows the normal distribution

(0, σ²).
 3. The processing method for high-order tensor data according to the claim 2, wherein, the vector of the formula (1) is formula (2) vec(Y _(ij))=Fh _(i) +Wg _(ij)+vec(M)+vec(E _(ij))  (2) wherein, columns f_(k) ₁ and w_(k) ₂ in the matrix F and W is respectively k1, k2 face of tensor

and

, are used to restrict the base tensor by CP decomposition of tensor; wherein, $\mathcal{F} = {{〚{{\lambda;F^{(1)}},F^{(2)},F^{(h)}}〛} = {\sum\limits_{r = 1}^{R_{1}}{\lambda_{r}{f_{1,r}^{(1)} \circ f_{1,r}^{(2)} \circ f_{1,r}^{(h)}}\mspace{14mu}{and}}}}$ $\mathcal{W} = {{〚{{\eta;W^{(1)}},W^{(2)},W^{(g)}}〛} = {\sum\limits_{r = 1}^{R_{2}}{\eta_{r}{{w_{1,r}^{(1)} \circ w_{1,r}^{(2)} \circ w_{1,r}^{(g)}}.}}}}$
 4. The processing method for high-order tensor data according to the claim 3, wherein, Q(h, g, ρ) is any joint distribution over a plurality of hidden variables, a lower bound on a likelihood of the observed data is formula (2), $\begin{matrix} \begin{matrix} {{\mathcal{L}(\Theta)} = {\log\;{p\left( {y❘\Theta} \right)}}} \\ {\geq {\int_{h}{\int_{g}{\int_{\rho}\;{{Q\left( {h,g,\rho} \right)}\;\log\frac{p\left( {y,h,g,\rho} \right)}{Q\left( {h,g,\rho} \right)}{dhdgd}\;\rho}}}}} \\ {= {{E_{h,g,\rho}\left\lbrack {\log\;{p\left( {{y❘h},g,\rho} \right)}} \right\rbrack} + {E_{h}\left\lbrack {\log\;\frac{p(h)}{Q(h)}} \right\rbrack} +}} \\ {{E_{g}\left\lbrack {\log\;\frac{p(g)}{Q(g)}} \right\rbrack} + {E_{\rho}\left\lbrack {\log\;\frac{p(\rho)}{Q(\rho)}} \right\rbrack}} \\ {\overset{\Delta}{=}{\hat{\mathcal{L}}\left( {{Q(h)},{Q(g)},{Q(\rho)},\Theta} \right)}} \end{matrix} & (3) \end{matrix}$ wherein, Q(h, g, ρ)=Q(h)Q(g)Q(ρ).
 5. The processing method for high-order tensor data according to claim 1, wherein, a high-order observed data set is {y_(ij)|y_(ij)∈

^(D) ¹ ^(× . . . ×D) ^(M) ; i=1, . . . , I, j=1 . . . , N_(i), Σ_(i)N_(i)=N}; each sample is a tensor of order M, and a linear discriminant analysis model of the higher-order tensor is formula (4) y _(ij)=

×_((M+1)) h _(i) ^(T)+

×_((M+1)) g _(ij) ^(T)+

_(ij)  (4); wherein h_(i)∈

^(K) ¹ , g_(ij)∈

^(K) ² ,

and

are a tensor of order (M+1), wherein a size of

and

is respectively D₁× . . . ×D_(M)×K₁ and D₁× . . . ×D_(M)×K₂; wherein, the base tensor has a structure of CP decomposition, $\begin{matrix} {\mathcal{F} = {〚{F^{(1)},F^{(2)},\ldots\mspace{14mu},F^{(M)},F^{(h)}}〛}} \\ {= {\sum\limits_{r = 1}^{R_{1}}{f_{1,r}^{(1)} \circ f_{1,r}^{(2)} \circ \mspace{11mu}\ldots\mspace{11mu} \circ f_{1,r}^{(M)} \circ f_{1,r}^{(h)}}}} \end{matrix}$ and $\begin{matrix} {\mathcal{W} = {〚{W^{(1)},W^{(2)},\ldots\mspace{14mu},W^{(M)},W^{(g)}}〛}} \\ {= {\sum\limits_{r = 1}^{R_{2}}{{w_{1,r}^{(1)} \circ w_{1,r}^{(2)} \circ \mspace{11mu}\ldots\mspace{11mu} \circ \; w_{1,r}^{(M)} \circ \; w_{1,r}^{(g)}}.}}} \end{matrix}$
 6. The processing method for high-order tensor data according to claim 5, wherein, for a higher-order model, an EM algorithm is used to solve a problem, a posterior of h_(i) is a multivariate normal distribution; a mean matrix and a covariance matrix are, respectively, $\begin{matrix} {{\overset{\_}{h}}_{i} = {\left( {\sum\limits_{\mathcal{F}}{{+ \frac{1}{\overset{\_}{\rho}\; N_{i}}}I_{K_{1}}}} \right)a_{i}}} \\ {\sum\limits_{h}{= \left( {I_{K\; 1} + {\frac{\overset{\_}{\rho}\; N}{I}\sum_{\mathcal{F}}}} \right)^{- 1}}} \end{matrix}$ where,

=F ^((h))[(F ^((m)T) F ^((m)))⊙( F ^((m)T) F ^((m)))]F ^((h)T) a _(i) =F ^((h))diag(F ^((m)T) Y _((m)) ^(i) F ^((m))), where ⊙ represents the Hadamard elementwise product; ${{\overset{\_}{y}}_{i} = {\frac{1}{N_{i}}{\sum\limits_{j = 1}^{N_{i}}\left( {y_{ij} - {\mathcal{W} \times_{({M + 1})}g_{ij}}} \right)}}},$ wherein, Y _((m)) ^(i) represents a mode-m metricized version of y _(i), and F ^((m)) =F ^((M))

. . .

F ^((m+1))

F ^((m−1))

. . .

F ⁽¹⁾ where

represents the Khatri-Rao product; by computing, the posterior of g _(ij) is a multivariate Gauss distribution, and a mean and a covariance of a latent variable g_(ij) are g _(ij)=

+1/ρI _(K))⁻¹ b _(ij) Σ_(g)=(I _(K)+ρ

)⁻¹ where b _(ij) =W ^((g))diag(W ^((m)T) Ŷ _((m)) ^(ij) W ^((m)))

=W ^((g))[(W ^((m)T) W ^((m)))⊙( W ^((m)T) W ^((m)))]W ^((g)T) wherein, ŷ_(ij)=y_(ij)−

×_((M+1))h_(i) ^(T), and Ŷ_((m)) ^(ij) represents a mode-m metricized version of ŷ_(ij); and W ^((m)) =W ^((M))

. . .

W ^((m+1))

W ^((m−1))

. . .

W ⁽¹⁾ an optimal Q*(ρ) is a Gamma distribution; in M-step, taking derivatives of

with respect to F^((n)) (n=1, . . . , N) and F^((h)), to be zero, obtain $F^{(m)} = {\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\overset{\sim}{Y}}_{(m)}^{ij}{\overset{\_}{F}}^{(m)}{{Diag}\left( {h_{i}^{T}F^{(h)}} \right)}}}} \right\rbrack{\quad{{\left\lbrack {\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right) \odot \left( {{{NF}^{{(h)}T}{\sum\limits_{h}F^{(h)}}} + {F^{{(h)}T}\overset{\sim}{H}\;{\overset{\sim}{H}}^{T}F^{(h)}}} \right)} \right\rbrack^{- 1}F^{(h)}} = {{\left\lbrack {{\overset{\sim}{H}\;{\overset{\sim}{H}}^{T}} + {N\;\sum\limits_{h}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{h_{i}\left\lbrack {{diag}\left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\sim}{Y}}_{(m)}^{{({ij})}T}F^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}{\quad{\left\lbrack {\left( {F^{{(m)}T}F^{(m)}} \right) \odot \left( {{\overset{\_}{F}}^{{(m)}T}{\overset{\_}{F}}^{(m)}} \right)} \right\rbrack\text{-}{1.}}}}}}}$ wherein, {tilde over (Y)}_((m)) ^(ij) represents a mode-m metricized version of {tilde over (y)}_(ij); and {tilde over (y)}_(ij)=y_(ij)−

×_((M+1)) g_(ij), taking derivatives of

with respect to W^((m)) (m=1, . . . , M) and W^((h)), to be zero, obtain $\begin{matrix} {W^{(m)} = {\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{{\hat{Y}}_{(m)}^{ij}{\overset{\_}{W}}^{(m)}{{Diag}\left( {g_{ij}^{T}W^{(g)}} \right)}}}} \right\rbrack{\quad\left\lbrack {\left( {{\overset{\_}{W}}^{{(m)}T}{\overset{\_}{W}}^{(m)}} \right) \odot \left( {{{NW}^{{(g)}T}{\sum\limits_{g}W^{(g)}}} + {W^{{(g)}T}{GG}^{T}W^{(g)}}} \right)} \right\rbrack^{- 1}}}} & \; \\ {\mspace{79mu}{and}} & \; \\ {W^{(h)} = {{\left\lbrack {{GG}^{T} + {N\sum\limits_{g}}} \right\rbrack^{- 1}\left\lbrack {\sum\limits_{i = 1}^{I}{\sum\limits_{j = 1}^{N_{i}}{g_{ij}\left\lbrack {{diag}\left( {{\overset{\_}{W}}^{{(m)}T}{\hat{Y}}_{(m)}^{{({ij})}T}W^{(m)}} \right)} \right\rbrack}^{T}}} \right\rbrack}{\quad{\left\lbrack {\left( {W^{{(m)}T}W^{(m)}} \right) \odot \left( {{\overset{\_}{W}}^{{(m)}T}W^{(m)}} \right)} \right\rbrack^{- 1}.}}}} & \; \end{matrix}$
 7. The processing method for high-order tensor data according to claim 6, wherein, for a sample Y_(p) to be classified, in a category by calculating a likelihood under a model; a key point of recognition is the plurality of samples in a same class share a same identity variable; assuming Y_(1 . . . N) is a plurality of observed samples, if Y_(p) and one particular Yi belong to a same category, Y_(p) and one particular Yi share a same h.
 8. The processing method for high-order tensor data according to claim 7, wherein, Y1 and Yp belongs to the same category and both of identity variables are h1; a likelihood of the plurality of observed samples and Yp as p₁ (Y₁, Y₂, Y_(p)), Yp and Y2 came from the same category with the identity variable h2 and the likelihood is represented as p₂ (Y₁, Y₂, Y_(p)); then comparing p₁ and p₂, and finding out a largest likelihood value; where Yp belongs to a sample's label corresponding to the largest likelihood; the likelihood of the plurality of samples is calculated as follows; Y_(ij) follows a matrix-variate normal conditioned on a latent variables h and g; proposing a model y _(ij) =Fh _(i) +Wg _(ij) +e _(ij) wherein, y_(ij):=vec(Y_(ij)), F=(F⁽²⁾⊙F⁽¹⁾)F^((h)T) and W=(W⁽²⁾⊙W⁽¹⁾)W^((g)T); a plurality of factor matrices have been learned in a training phrase, the plurality of factor matrices are known to a recognition processing; the likelihood of N equals $\begin{matrix} {{{p_{n}\left( y_{{1\;\ldots\mspace{11mu} N},p} \right)} = {\prod\limits_{{i = 1},{i \neq n}}^{N}{{p\left( y_{i} \right)}{p\left( {y_{p},y_{n}} \right)}}}};} & (20) \end{matrix}$ illustrating a marginal distribution of n images y_(1 . . . n) that share the same identity variable h; Forming generative equations $\begin{matrix} {{\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = {{\begin{bmatrix} F & W & 0 & \ldots & 0 \\ F & 0 & W & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ F & 0 & 0 & \ldots & W \end{bmatrix}\begin{bmatrix} h \\ g_{1} \\ g_{2} \\ \vdots \\ g_{n} \end{bmatrix}} + \begin{bmatrix} e_{1} \\ e_{2} \\ \vdots \\ e_{n} \end{bmatrix}}},} & (5) \end{matrix}$ wherein, y′=Az′+m′+e′ taking the following probabilistic assumption p(z′)=

(0,I) and p(e′)=

(0,σ² I) converting the model (5) into the form of probabilistic PCA to obtain the marginal distribution over the observed y′ by integrating out the latent variables as p(y′)=

(m′,AA ^(T)+σ² I)  (21); using (21), calculating both p(y_(i)) and p(y_(p), y_(n)) and in (20) and then obtaining p_(n)(y_(1, . . . , N, p)); comparing all p_(n)(y_(1, . . . , N, p)) and finding out a largest value, then dividing the sample Yp the sample's label corresponding to the largest p_(n)(y_(1, . . . , N, p)). 